I10.9
I10.9 Consider the USArrests data. We will now perform hierarchical clustering on the states.
(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
(c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
set.seed(1995)
dat1 <- USArrests
(a)
fit.hclust <- hclust(dist(dat1), method = "complete")
plot(fit.hclust, main = "States clusters")

(b)
fit.hclust.c <- cutree(fit.hclust, 3)
fit.hclust.c
Alabama Alaska Arizona Arkansas California
1 1 1 2 1
Colorado Connecticut Delaware Florida Georgia
2 3 1 1 2
Hawaii Idaho Illinois Indiana Iowa
3 3 1 3 3
Kansas Kentucky Louisiana Maine Maryland
3 3 1 3 1
Massachusetts Michigan Minnesota Mississippi Missouri 2 1 3 1 2 Montana Nebraska Nevada New Hampshire New Jersey 3 3 1 3 2 New Mexico New York North Carolina North Dakota Ohio 1 1 1 3 3 Oklahoma Oregon Pennsylvania Rhode Island South Carolina 2 2 3 2 1 South Dakota Tennessee Texas Utah Vermont 3 2 2 3 3 Virginia Washington West Virginia Wisconsin Wyoming 2 2 3 3 2
fit.hclust.c %>% table() %>% t() %>% # xtable() %>%
kable(caption = "Number of states in each cluster") %>% kable_styling(position = "left",
bootstrap_options = "hover")
Number of states in each cluster
|
1
|
2
|
3
|
|
16
|
14
|
20
|
(c)
dat2 <- scale(dat1)
fit.hclust.s <- hclust(dist(dat2), method = "complete")
plot(fit.hclust.s)

(d)
fit.hclust.s.c <- cutree(fit.hclust.s, 3)
fit.hclust.s.c
Alabama Alaska Arizona Arkansas California
1 1 2 3 2
Colorado Connecticut Delaware Florida Georgia
2 3 3 2 1
Hawaii Idaho Illinois Indiana Iowa
3 3 2 3 3
Kansas Kentucky Louisiana Maine Maryland
3 3 1 3 2
Massachusetts Michigan Minnesota Mississippi Missouri 3 2 3 1 3 Montana Nebraska Nevada New Hampshire New Jersey 3 3 2 3 3 New Mexico New York North Carolina North Dakota Ohio 2 2 1 3 3 Oklahoma Oregon Pennsylvania Rhode Island South Carolina 3 3 3 3 1 South Dakota Tennessee Texas Utah Vermont 3 1 2 3 3 Virginia Washington West Virginia Wisconsin Wyoming 3 3 3 3 3
fit.hclust.s.c %>% table() %>% t() %>% # xtable() %>%
kable(caption = "Number of states in each cluster (scaled data)") %>% kable_styling(position = "left",
bootstrap_options = "hover")
Number of states in each cluster (scaled data)
|
1
|
2
|
3
|
|
8
|
11
|
31
|
dd <- table(fit.hclust.c, fit.hclust.s.c)
rownames(dd) <- c("Non-scaled 1", "Non-scaled 2", "Non-scaled 3")
colnames(dd) <- c("Scaled 1", "Scaled 2", "Scaled 3")
dd %>% # xtable() %>%
kable(caption = "Model comparison on non-scaled vs scaled data") %>% kable_styling(position = "left",
bootstrap_options = "hover") %>% column_spec(1, bold = TRUE)
Model comparison on non-scaled vs scaled data
|
|
Scaled 1
|
Scaled 2
|
Scaled 3
|
|
Non-scaled 1
|
6
|
9
|
1
|
|
Non-scaled 2
|
2
|
2
|
10
|
|
Non-scaled 3
|
0
|
0
|
20
|
- Scaling the variables have effect on the maximum height of dendogram. However the trees look similar.
- Scaling should be done, because variables are in different units.
I4.13
I4.13 Using the Boston data set, fit classification models in order to predict whether a given suburb has a crime rate above or below the median. Explore logistic regression, LDA, and KNN models using various subsets of the predictors. Describe your findings.
Create train and test sets
set.seed(1995)
train_ind <- sample(nrow(dat), nrow(dat) * 0.7)
train <- as.data.table(dat[train_ind, ])
test <- as.data.table(dat[-train_ind, ])
train[, .N, crim.ind] %>% kable(caption = "Crime rate comparison to median (train set)") %>%
kable_styling(position = "left", bootstrap_options = "hover")
Crime rate comparison to median (train set)
|
crim.ind
|
N
|
|
1
|
184
|
|
0
|
170
|
test[, .N, crim.ind] %>% kable(caption = "Crime rate comparison to median (test set)") %>%
kable_styling(position = "left", bootstrap_options = "hover")
Crime rate comparison to median (test set)
|
crim.ind
|
N
|
|
0
|
83
|
|
1
|
69
|
Logistic regression
All variables
fit.glm <- glm(crim.ind ~ ., data = train, family = binomial)
test$predict <- predict(fit.glm, test, type = "response")
test[, `:=`(crim.p, fifelse(predict < 0.5, 0, 1))]
table(test$crim.ind, test$crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>%
column_spec(1, bold = T)
Actual vs predicted
|
|
0
|
1
|
|
0
|
75
|
8
|
|
1
|
7
|
62
|
acc <- round(mean(test$crim.p == test$crim.ind), 3) * 100
err <- round(mean(test$crim.p != test$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")
Test accuracy: 90.1 %; Test error rate: 9.9 %
fit.glm %>% summary() %>% pander()
| (Intercept) |
-32.39 |
10.52 |
-3.08 |
0.002071 |
| zn |
-0.1067 |
0.04466 |
-2.389 |
0.01687 |
| indus |
-0.08002 |
0.06342 |
-1.262 |
0.207 |
| chas |
0.8943 |
0.9275 |
0.9642 |
0.3349 |
| nox |
54.91 |
10.2 |
5.385 |
7.251e-08 |
| rm |
-0.1994 |
0.9313 |
-0.2141 |
0.8305 |
| age |
0.03965 |
0.01722 |
2.302 |
0.02132 |
| dis |
0.958 |
0.3168 |
3.024 |
0.002497 |
| rad |
0.7909 |
0.2016 |
3.922 |
8.77e-05 |
| tax |
-0.007253 |
0.003429 |
-2.115 |
0.03439 |
| ptratio |
0.4935 |
0.1836 |
2.687 |
0.007209 |
| black |
-0.04434 |
0.01821 |
-2.434 |
0.01492 |
| lstat |
0.03954 |
0.0695 |
0.5689 |
0.5694 |
| medv |
0.2064 |
0.09856 |
2.094 |
0.03625 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: |
490.2 on 353 degrees of freedom |
| Residual deviance: |
128.0 on 340 degrees of freedom |
Excluding non significant variables
fit.glm2 <- glm(crim.ind ~ . - indus - chas - rm - lstat, data = train, family = binomial)
test$predict <- predict(fit.glm2, test, type = "response")
test[, `:=`(crim.p, fifelse(predict < 0.5, 0, 1))]
table(test$crim.ind, test$crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>%
column_spec(1, bold = T)
Actual vs predicted
|
|
0
|
1
|
|
0
|
72
|
11
|
|
1
|
7
|
62
|
acc <- round(mean(test$crim.p == test$crim.ind), 3) * 100
err <- round(mean(test$crim.p != test$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")
Test accuracy: 88.2 %; Test error rate: 11.8 %
fit.glm2 %>% summary() %>% pander()
| (Intercept) |
-30.07 |
9.982 |
-3.013 |
0.002588 |
| zn |
-0.111 |
0.04267 |
-2.601 |
0.009286 |
| nox |
49.15 |
8.923 |
5.508 |
3.631e-08 |
| age |
0.0401 |
0.01411 |
2.842 |
0.004478 |
| dis |
0.8805 |
0.2954 |
2.981 |
0.002872 |
| rad |
0.9039 |
0.1925 |
4.695 |
2.67e-06 |
| tax |
-0.00912 |
0.003004 |
-3.036 |
0.002398 |
| ptratio |
0.4294 |
0.1558 |
2.757 |
0.005834 |
| black |
-0.04029 |
0.01644 |
-2.45 |
0.01427 |
| medv |
0.1729 |
0.05443 |
3.177 |
0.001488 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: |
490.2 on 353 degrees of freedom |
| Residual deviance: |
130.7 on 344 degrees of freedom |
LDA
All variables
fit.lda <- lda(crim.ind ~ ., data = train)
test$crim.p <- predict(fit.lda, test, type = "response")$class
table(test$crim.ind, test$crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>%
column_spec(1, bold = T)
Actual vs predicted
|
|
0
|
1
|
|
0
|
77
|
6
|
|
1
|
11
|
58
|
acc <- round(mean(test$crim.p == test$crim.ind), 3) * 100
err <- round(mean(test$crim.p != test$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")
Test accuracy: 88.8 %; Test error rate: 11.2 %
fit.lda %>% summary() %>% pander()
| prior |
2 |
-none- |
numeric |
| counts |
2 |
-none- |
numeric |
| means |
26 |
-none- |
numeric |
| scaling |
13 |
-none- |
numeric |
| lev |
2 |
-none- |
character |
| svd |
1 |
-none- |
numeric |
| N |
1 |
-none- |
numeric |
| call |
3 |
-none- |
call |
| terms |
3 |
terms |
call |
| xlevels |
0 |
-none- |
list |
Excluding some variables
fit.lda2 <- lda(crim.ind ~ . - indus - chas - rm - lstat, data = train)
test$crim.p <- predict(fit.lda2, test, type = "response")$class
table(test$crim.ind, test$crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>%
column_spec(1, bold = T)
Actual vs predicted
|
|
0
|
1
|
|
0
|
77
|
6
|
|
1
|
10
|
59
|
acc <- round(mean(test$crim.p == test$crim.ind), 3) * 100
err <- round(mean(test$crim.p != test$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")
Test accuracy: 89.5 %; Test error rate: 10.5 %
fit.lda2 %>% summary() %>% pander()
| prior |
2 |
-none- |
numeric |
| counts |
2 |
-none- |
numeric |
| means |
18 |
-none- |
numeric |
| scaling |
9 |
-none- |
numeric |
| lev |
2 |
-none- |
character |
| svd |
1 |
-none- |
numeric |
| N |
1 |
-none- |
numeric |
| call |
3 |
-none- |
call |
| terms |
3 |
terms |
call |
| xlevels |
0 |
-none- |
list |
KNN
# select features
nm <- which(!names(dat) %in% c("indus", "chas", "rm", "lstat"))
dat.knn <- copy(dat[, nm, with = FALSE])
# transform variables
nm <- names(dat.knn)
nm <- nm[!grepl("crim.ind", nm)]
dat.knn[, `:=`((nm), lapply(.SD, scale)), .SDcols = nm]
# split train - test sets
train.knn <- dat.knn[train_ind, ]
test.knn <- dat.knn[-train_ind, ]
K=1
# KNN with K=1
crim.p <- knn(train.knn[, !"crim.ind"], test.knn[, !"crim.ind"], train.knn[, crim.ind],
k = 1)
table(test.knn[, crim.ind], crim.p) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>%
column_spec(1, bold = T)
Actual vs predicted
|
|
0
|
1
|
|
0
|
75
|
8
|
|
1
|
4
|
65
|
acc <- round(mean(crim.p == test.knn$crim.ind), 3) * 100
err <- round(mean(crim.p != test.knn$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")
Test accuracy: 92.1 %; Test error rate: 7.9 %
K=3
# KNN with K=3
crim.p3 <- knn(train.knn[, !"crim.ind"], test.knn[, !"crim.ind"], train.knn[, crim.ind],
k = 3)
table(test.knn[, crim.ind], crim.p3) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>%
column_spec(1, bold = T)
Actual vs predicted
|
|
0
|
1
|
|
0
|
78
|
5
|
|
1
|
5
|
64
|
acc <- round(mean(crim.p3 == test.knn$crim.ind), 3) * 100
err <- round(mean(crim.p3 != test.knn$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")
Test accuracy: 93.4 %; Test error rate: 6.6 %
K=5
# KNN with K=5
crim.p5 <- knn(train.knn[, !"crim.ind"], test.knn[, !"crim.ind"], train.knn[, crim.ind],
k = 5)
table(test.knn[, crim.ind], crim.p5) %>% # xtable() %>%
kable(caption = "Actual vs predicted") %>% kable_styling(position = "left", bootstrap_options = "hover") %>%
column_spec(1, bold = T)
Actual vs predicted
|
|
0
|
1
|
|
0
|
81
|
2
|
|
1
|
4
|
65
|
acc <- round(mean(crim.p5 == test.knn$crim.ind), 3) * 100
err <- round(mean(crim.p5 != test.knn$crim.ind), 3) * 100
cat("Test accuracy:", acc, "%; \n", "Test error rate:", err, "% \n")
Test accuracy: 96.1 %; Test error rate: 3.9 %
KNN performed best.
I9.5
I9.5 We have seen that we can fit an SVM with a non-linear kernel in order to perform classification using a non-linear decision boundary. We will now see that we can also obtain a non-linear decision boundary by performing logistic regression using non-linear transformations of the features.(a) Generate a data set with n = 500 and p = 2, such that the observations belong to two classes with a quadratic decision boundary between them. For instance, you can do this as follows:
> x1=runif (500) -0.5
> x2=runif (500) -0.5
> y=1*( x1^2-x2^2 > 0)
(b) Plot the observations, colored according to their class labels. Your plot should display X1 on the x-axis, and X2 on the yaxis.
(c) Fit a logistic regression model to the data, using X1 and X2 as predictors.
(d) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be linear.
(e) Now fit a logistic regression model to the data using non-linear functions of \(X_1\) and \(X_2\) as predictors (e.g. \(X^2_1\) , \(X_1 \times X_2\), \(log(X_2)\), and so forth).
(f) Apply this model to the training data in order to obtain a predicted class label for each training observation. Plot the observations, colored according to the predicted class labels. The decision boundary should be obviously non-linear. If it is not, then repeat (a)-(e) until you come up with an example in which the predicted class labels are obviously non-linear.
(g) Fit a support vector classifier to the data with \(X_1\) and \(X_2\) as predictors. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
(h) Fit a SVM using a non-linear kernel to the data. Obtain a class prediction for each training observation. Plot the observations, colored according to the predicted class labels.
(i) Comment on your results.
(a)
set.seed(1995)
x1 <- runif(500) - 0.5
x2 <- runif(500) - 0.5
y <- 1 * (x1^2 - x2^2 > 0)
(b)
dat9 <- data.table(as.factor(y), x1, x2)
# plot_ly(dd, x=~x1) %>% add_markers(y = ~x2, showlegend = FALSE)
fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y, symbol = ~y, type = "scatter",
mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Y class: ",
y, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2, 3)))
fig <- fig %>% layout(title = "X1 vs X2", xaxis = list(title = "X1"), yaxis = list(title = "X2"))
fig
(c)
fit.glm9 <- glm(y ~ x1 + x2, family = binomial)
fit.glm9 %>% summary() %>% pander()
| (Intercept) |
0.06214 |
0.08976 |
0.6922 |
0.4888 |
| x1 |
-0.05384 |
0.3139 |
-0.1715 |
0.8638 |
| x2 |
-0.3561 |
0.3165 |
-1.125 |
0.2604 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: |
692.8 on 499 degrees of freedom |
| Residual deviance: |
691.5 on 497 degrees of freedom |
(d)
dat9$predict <- predict(fit.glm9, dat9, type = "response")
dat9[, `:=`(y.p, fifelse(predict < 0.5, 0, 1))]
fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y.p, symbol = ~y.p, type = "scatter",
mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Real: ",
y, "</br> Predicted: ", y.p, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2,
3)))
fig <- fig %>% layout(title = "Logistic regression results", xaxis = list(title = "X1"),
yaxis = list(title = "X2"))
fig
(e)
fit.glm92 <- glm(y ~ x1 + x2 + I(x1^2) + I(x2^2) + I(x1 * x2), data = dat9, family = binomial)
fit.glm92 %>% summary() %>% pander()
| (Intercept) |
-5.727 |
478 |
-0.01198 |
0.9904 |
| x1 |
48.12 |
7921 |
0.006075 |
0.9952 |
| x2 |
37.57 |
9631 |
0.003901 |
0.9969 |
| I(x1^2) |
25208 |
597146 |
0.04221 |
0.9663 |
| I(x2^2) |
-24493 |
579277 |
-0.04228 |
0.9663 |
| I(x1 * x2) |
229 |
30926 |
0.007405 |
0.9941 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: |
6.928e+02 on 499 degrees of freedom |
| Residual deviance: |
6.313e-06 on 494 degrees of freedom |
(f)
dat9$predict2 <- predict(fit.glm92, dat9, type = "response")
dat9[, `:=`(y.p2, fifelse(predict2 < 0.5, 0, 1))]
fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y.p2, symbol = ~y.p2, type = "scatter",
mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Real: ",
y, "</br> Predicted: ", y.p2, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2,
3)))
fig <- fig %>% layout(title = "Logistic regression results", xaxis = list(title = "X1"),
yaxis = list(title = "X2"))
fig
(g)
fit.svm <- svm(as.factor(y) ~ x1 + x2, dat9, kernel = "linear", cost = 0.01)
dat9$y.p3 <- predict(fit.svm, dat9, type = "response")
fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y.p3, symbol = ~y.p3, type = "scatter",
mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Real: ",
y, "</br> Predicted: ", y.p3, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2,
3)))
fig <- fig %>% layout(title = "SVM results", xaxis = list(title = "X1"), yaxis = list(title = "X2"))
fig
(h)
fit.svm2 = svm(as.factor(y) ~ x1 + x2, dat9, gamma = 1)
dat9$y.p4 <- predict(fit.svm2, dat9, type = "response")
fig <- plot_ly(dat9, x = ~x1, y = ~x2, color = ~y.p4, symbol = ~y.p4, type = "scatter",
mode = "markers", hoverinfo = "text", marker = list(size = 7), text = ~paste("</br> Real: ",
y, "</br> Predicted: ", y.p4, "</br> X1: ", round(x1, 3), "</br> X2: ", round(x2,
3)))
fig <- fig %>% layout(title = "SVM results", xaxis = list(title = "X1"), yaxis = list(title = "X2"))
fig
Bonus Tasks
1. Text document analysis
- Select any your own written document, preferably in English (Lithuanian could be done in UTF-8 encoding) or an article in Econometrics. Write an R code that selects the most frequent words in the paper and also for each subsection. Draw the visual expression of the words. Can the topic of the subsection be guessed from the visual representation of the most frequent words? Are the topics related by some keywords?
Words frequency
dat.pdf <- Corpus(URISource("straipsnis.pdf"), readerControl = list(reader = readPDF))
dtm <- TermDocumentMatrix(dat.pdf, control = list(removePunctuation = TRUE, stopwords = TRUE,
tolower = TRUE, stemming = TRUE, removeNumbers = TRUE, bounds = list(global = c(1,
Inf))))
mm <- as.matrix(dtm)
ww <- sort(rowSums(mm), decreasing = TRUE)
dd <- data.frame(word = names(ww), freq = ww)
head(dd, 10) %>% kable(caption = "10 most popular words") %>% kable_styling(position = "left",
bootstrap_options = "hover")
10 most popular words
|
|
word
|
freq
|
|
path
|
path
|
32
|
|
weight
|
weight
|
30
|
|
rout
|
rout
|
29
|
|
algorithm
|
algorithm
|
24
|
|
shortest
|
shortest
|
24
|
|
time
|
time
|
21
|
|
road
|
road
|
20
|
|
electr
|
electr
|
18
|
|
energi
|
energi
|
17
|
|
vehicl
|
vehicl
|
17
|
Words frequency plot
set.seed(1995)
wordcloud(words = dd$word, freq = dd$freq, min.freq = 5, max.words = 200, random.order = FALSE,
rot.per = 0.35, colors = brewer.pal(8, "Dark2"))

The paper is about Dijkstra (weighted shortest path algorithm) application for electric vehicles route planning. Frequency graph represents the document.
Words frequency plot by section
dat.pdf.c <- VCorpus(VectorSource(pdf_text("straipsnis.pdf")))
dtm.c <- TermDocumentMatrix(dat.pdf.c, control = list(removePunctuation = TRUE, stopwomm2rds = TRUE,
tolower = TRUE, stemming = TRUE, removeNumbers = TRUE, bounds = list(global = c(1,
Inf))))
mm2 <- as.matrix(dtm.c)
for (i in 1:(ncol(mm2))) {
stat <- mm2[, i]
v <- sort(stat, decreasing = TRUE)
d <- data.frame(word = names(v), freq = v)
wordcloud(words = d$word, freq = d$freq, min.freq = 1, max.words = 20, random.order = FALSE,
rot.per = 0.35, colors = brewer.pal(8, "Dark2"))
}







Visual representation of subsections is nor very informative. But it is possible to understand general idea of subsection.
2. Barro-Lee example
- In library(hdm)() in R. Work with Barro-Lee* example on the convergence dataset, where randomly drop a subset of 20 countries. For random generator set.seed(yyyymmdd), where letters are your birthday. post-squared root LASSO; compute full OLS estimate; make a double selection with partialling out and double selection routine. Compare the results for the inference of the treatment variable (a third variable in the dataset), which reflects the conditional convergence hypothesis. Apply the same estimations for the full dataset. Are the results robust to the exclusion of 20 countries? For additional details see 12 p. in vignette(‘hdm_vignette’).
Dropping 20 countries
set.seed(19950330)
dat.b <- GrowthData
setDT(dat.b)
drop.ind <- sample(1:(nrow(dat.b)), 20)
dat.b <- dat.b[!drop.ind, ]
dim.b <- dim(dat.b)
cat("Data dimensions after dropping 20 countries: rows:", dim.b[1], "; columns:",
dim.b[2], ". \n")
Data dimensions after dropping 20 countries: rows: 70 ; columns: 63 .
# from vignette
y <- dat.b[, 1, drop = F]
d <- dat.b[, 3, drop = F]
X <- as.matrix(dat.b)[, -c(1, 2, 3)]
varnames <- colnames(dat.b)
# OLS:
xnames <- varnames[-c(1, 2, 3)]
dandxnames <- varnames[-c(1, 2)]
# create formulas by pasting names (this saves typing times)
fmla <- as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect <- lm(fmla, data = dat.b)
# effect by the partialling out by Post-Lasso:
lasso.effect <- rlassoEffect(x = X, y = y, d = d, method = "partialling out")
summary(lasso.effect)
[1] “Estimates and significance testing of the effect of target variables” Estimate. Std. Error t value Pr(>|t|)
[1,] -0.03932 0.02066 -1.903 0.057 . — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
# double selection:
doublesel.effect <- rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)
[1] “Estimates and significance testing of the effect of target variables” Estimate. Std. Error t value Pr(>|t|)
gdpsh465 -0.03932 0.02097 -1.875 0.0608 . — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
table <- rbind(summary(ls.effect)$coef["gdpsh465", 1:2], summary(lasso.effect)$coef[,
1:2], summary(doublesel.effect)$coef[, 1:2])
colnames(table) <- c("Estimate", "Std. Error")
rownames(table) <- c("OLS", "Post-Lasso ", "Double selection")
table %>% # xtable(digits=c(2, 2,5))%>% xtable() %>%
kable(caption = "Models comparison") %>% kable_styling(position = "left", bootstrap_options = "hover") %>%
column_spec(1, bold = T)
Models comparison
|
|
Estimate
|
Std. Error
|
|
OLS
|
-0.0287775
|
0.0477123
|
|
Post-Lasso
|
-0.0393177
|
0.0206559
|
|
Double selection
|
-0.0393177
|
0.0209711
|
Full dataset
dat.b.f <- GrowthData
setDT(dat.b.f)
dim.b.f <- dim(dat.b.f)
cat("Full dataset dimensions: rows:", dim.b.f[1], "; columns:", dim.b.f[2], ". \n")
Full dataset dimensions: rows: 90 ; columns: 63 .
# from vignette
y <- dat.b.f[, 1, drop = F]
d <- dat.b.f[, 3, drop = F]
X <- as.matrix(dat.b.f)[, -c(1, 2, 3)]
varnames <- colnames(dat.b.f)
# OLS:
xnames <- varnames[-c(1, 2, 3)]
dandxnames <- varnames[-c(1, 2)]
# create formulas by pasting names (this saves typing times)
fmla <- as.formula(paste("Outcome ~ ", paste(dandxnames, collapse = "+")))
ls.effect <- lm(fmla, data = dat.b.f)
# effect by the partialling out by Post-Lasso:
lasso.effect <- rlassoEffect(x = X, y = y, d = d, method = "partialling out")
summary(lasso.effect)
[1] “Estimates and significance testing of the effect of target variables” Estimate. Std. Error t value Pr(>|t|)
[1,] -0.04981 0.01394 -3.574 0.000351 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
# double selection:
doublesel.effect <- rlassoEffect(x = X, y = y, d = d, method = "double selection")
summary(doublesel.effect)
[1] “Estimates and significance testing of the effect of target variables” Estimate. Std. Error t value Pr(>|t|)
gdpsh465 -0.05001 0.01579 -3.167 0.00154 ** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
table <- rbind(summary(ls.effect)$coef["gdpsh465", 1:2], summary(lasso.effect)$coef[,
1:2], summary(doublesel.effect)$coef[, 1:2])
colnames(table) <- c("Estimate", "Std. Error")
rownames(table) <- c("OLS", "Post-Lasso ", "Double selection")
table %>% kable(caption = "Models comparison") %>% kable_styling(position = "left",
bootstrap_options = "hover") %>% column_spec(1, bold = T)
Models comparison
|
|
Estimate
|
Std. Error
|
|
OLS
|
-0.0093780
|
0.0298877
|
|
Post-Lasso
|
-0.0498115
|
0.0139364
|
|
Double selection
|
-0.0500059
|
0.0157914
|